Background
About 5 months ago, I tripped on 25I-Nbome and had a psychedelic experience in which I realized how optimization algorithms manifest/result in evolution and intelligence. Thereafter, I became very aware of various algorithms that I use in my everyday life to optimize the things that I value. One of the more concrete examples of optimization that I encounter is the physical optimization of the power level of an OPA used in my spectroscopy lab. In optimizing said OPA, there are 7 parameters (2 knobs and 5 computer-controlled motors) that I can adjust. As with any optimization algorithm, each parameter can be thought of as a dimension in which I can move while probing some optimizable function.
This is how I optimize the OPA power:
1. Change a parameter by a certain amount.
2a. If power is higher than before, repeat (1).
2b. If power is lower than before, flip parameter change direction and reduce change amount by half. Repeat (1).
2c. If power seems unchanged, move to next parameter and repeat (1).
[Cease optimization when a suitable power level has been reached]
I began to think very deeply about the following question: Is there an optimal way to experimentally find the optimum of ANY n-dimensional space. The answer to my question requires optimization of the optimization algorithm itself. I started thinking about and programming a self-optimizing algorithm. I want this algorithm to be experimental in nature because I am primarily interested in optimization as it applies to life and consciousness. Algorithms such as gradient descent require knowledge of the underlying function, but we experience the world one piece at a time and thus must use methods that apply in such cases when information is very limited. My idea is that there is a general template of an optimizable function in which any problem encountered in life can be formatted, and that there is an associated general optimization algorithm with internal parameters that can themselves be optimized such that the mean optimization result over all functions in the optimizable-function space is an optimum.
I started writing the program in C++ but switched to Scilab for convenience and portability. The general optimization algorithm I have chosen is currently incomplete and is modeled along the method I use to optimize power levels in the lab. The optimization algorithm takes algorithm parameters, the function-to-be-optimized, and a starting point as input. The optimization parameters are:
1. starting step-size
2. step change
3. slope threshold
There are 2 fixed parameters that do not affect optimization:
1. delimiting slope
2. delimiting number of steps
The optimization algorithm is as follows:
(Start with dimension 1 and step-size = p(1))
1. Take a step of given magnitude (starting step-size for first step) along the current dimension.
2a. If function value decreases (algorithm seeks to find a minimum), repeat (1).
2b. If function value increases, reverse step direction and decrease step magnitude by some factor (step change). Repeat (1).
2c. If function value increases by less than the delimitting slope, cease optimization.
2d. If function value increases by less than the slope threshold, change to next dimension and go back to starting step-size. If current dimension is the last in line, decrease starting step-size by [step change] factor and go back to first dimension.
[Cease optimization when delimiting number of steps have been taken]
There are a few additional features that I think are logically desirable in the algorithm:
1. Step-size is increased by [step change] factor if a specified number of steps have been taken in the same direction. Converging on the peak should be quick once it is in a range comparable to the step-size. This feature allows the probe to move quicker in one direction if a peak is not found immediately.
2. Optimization will not cease until at least 1 round (optimization over all dimensions) has been completed.
3. Force positive probe location for meta-optimization. If the slope threshold parameter is negative, weird things happen.
4. Randomize direction between dimension changes. If this does not occur, certain directions are always preferred, and thus certain peak configurations will always yield a poor result.
To test this algorithm, I use it to find a minimum in a field of 6 negative 2D Gaussians. The Gaussians have heights between 0 and -1, FWHMs between 0 and 1, and centers between 0 and 1 in both dimensions. For optimization, the starting point is somewhere between 0 and 1 in both dimensions. The optimization is intended to take place between 0 and 1, and the only parameter that would be scale-dependent is the starting step-size.
Examples of optimization algorithm:
(Title is number of steps required to encounter delimiting value change, N. Black corresponds to minimum)
Top: Different optimums found during different trials with the same Gaussian field.
Bottom: (left)Normal result. (right)Number of steps maxed out during this trial.
Top and Bottom-left: Starting from the same place, 3 different peaks can be found.
Bottom: An example of the step-size being increased in order to find minimum quicker.
To measure the efficacy of a given set of optimization algorithm parameters, I must first specify a definition for fitness. For simplicity's sake, I decide to use the number of steps required to reach AN optimum. To test the general fitness of a given set of optimization parameters, I find the normalized mean of N (mean/max_N) over a specified number of randomly chosen gaussian fields and starting positions (this is a basically a Monte Carlo simulation of parameter fitness). I have a function that finds the normalized mean fitness for a given set of parameters. I pass this function to my optimization algorithm with a set of transcendental optimization parameters (which do not change) and a starting point for the parameters that will be optimized.
A few example results of the meta-optimization algorithm:
(parameters: p=[starting step-size, step change, slope threshold])
Top-right: It found an optimum eventually.
Bottom-right: An example of an error: The most likely explanation is that the delimiting slope is encountered quickly because the starting step-size is so large that the probe immediately moves to an area where the slope is approximately 0 in all directions.
Future improvements
In many ways, this current algorithm is a toy problem with which to test the meta-optimization concept and get everything working on that front. There are many key problems that I need to address for a more complete/general algorithm:
1. I don't think that the same factor should be used for decreasing the step-size after a direction flip and after a round has been completed.
2. There are other values that I could use to define fitness. For example, I could use the probe height at the end of optimization. Instead of using a delimiting slope, I could use only a delimiting step number for all optimizations and then use the final slope as the fitness value.
3. The slope of a step can be below the delimiting slope if a large step is taken OVER a minimum, thus causing the optimization algorithm to cease prematurely. This currently happens too often for my comfort. The change mentioned above in (2) would fix this problem.
4. Is it better to do rough and fine optimization along one dimension before switching to the next, or would it be better to do rough optimization for all dimensions then finer for all?
5. The algorithm currently assumes that there is no noise. A better algorithm would, for example, stop searching for peak when the change resulting from a step is less than the noise level. If the algorithm measures noise level during optimization, it can decide to take more than one measurement per location in order to make more accurate decisions.
6. The algorithm can only remember one point. It uses the previous point to decide it's next decision, but if it used the previous 2 or 3 points, it could calculate rate of change of slope in order to predict better next moves. As the number of points considered increases, the optimization algorithm approaches a discrete gradient descent method.
7. Add a control algorithm that searches for peak by picking points randomly. Comparison to a genetic algorithm would also be interesting.
8. What to do with the several good results for optimal parameters? I could use a genetic algorithm to breed these together.
9. Enable dynamic transcendental parameters. This would allow the meta-optimization algorithm to make use of better sets of parameters as it finds them.
10. Make current algorithm more general. If aspects of the underlying methods in place are truly better than others, then a more general algorithm will prove it.
11. Incorporate function types other than just sum-of-Gaussians. Lorentzians and cones would be good additions. Diagonal peaks should also be possible.
12. Diagonal steps should be possible. An intelligent algorithm should be able to move in two dimensions at once when it finds that optimization correlates with both.
13. How does optimal starting step-size relate to scale of optimizable function. There should be a correlation.
14. What do you do after get to the top of the mountain? You look for a taller mountain to climb. My algorithm should be able to do this eventually.
14. What effect does changing the number of dimensions have on the optimal optimization parameters?
15. It may be necessary to have different step parameters for each dimension. These would not be fixed general values because it should always be possible to mix up dimensions and get the same result. Rather, the algorithm should be able to learn how motion in different dimensions affects the function value, thereby creating custom techniques for each dimension during optimization.
More to come. Thanks for reading.
Hixidom
UPDATE #1 6/6/2013
I spoke with a Mathematics professor today (in particular, the one who taught my Numerical Methods course back in the day). He made the following suggestions:
1. Use final height of probe to define fitness rather than the number of steps required to reach a particular point.
2. Learn more about the shape of the optimization function by holding all optimization parameters fixed but one and seeing how the the fitness of an algorithm is related to the remaining parameter. I assume that the optimization function and the Gaussian-sum function are similar, when this is likely not the case.
3. Try reducing the number of optimization parameters by defining, or finding through analysis, a reasonable relationship between two or more of the parameters.
UPDATE #2 6/15/2013
So, I changed the optimization algorithm so that it now takes a delimiting number of steps as input and gives the probe level reached after that many steps. The problem is that if I allow enough steps to actually find the minimum, almost all algorithms come so close to the minimum that the contrast between the results for different algorithms is less than the noise level inherent in the Monte Carlo simulation that is used for meta-optimization. More on this later.
Visualization of algorithm parameter-space
The first thing I did in response to the mathematician's recommendations is perform an experiment that fully explores the 3 algorithm parameters used. The result is a series of cross-sections showing algorithm fitness with respect to starting step-size and step change factor, with slices taken along the slope threshold axis:
The optimal point seems to occur at starting step-size=0.09, step change factor=0.8, slope threshold=0.52, so I take more detailed cross-sections along each parameter dimension at this optimal point:
Top-Left: A higher contrast plot of the optimal point as it appears in the GIF
First of all, regarding the "shape" of parameter-space: Just looking at the Top-Right plot, my first impression is that it is Gaussian in the step change factor direction and exponential in the starting step-size direction. I restrict starting step-size values to be positive in my meta-optimization algorithm, but this is probably not necessary. The other parameters, however, must be positive for the algorithm to work. I guess there's no harm in allowing negative parameters; they just wont converge. There will, however, be problems if, for example, the starting step-size were to end up being 0 somehow, since this would cause a division by 0 at one point in the algorithm. I should add a means of restricting bad parameters. In principle, the function to be optimized should be able to return an error result. In other words, it is not unreasonable for a function to have regions that cannot be probed for one reason or another, and this trait should be programmed into functions from now on where applicable, with handling of such errors being programmed into the optimization algorithm accordingly.
Another thing I notice, in the Bottom-Left plot, is that there is almost no dependence on slope threshold. This can be problematic, and it may be necessary to account for this type of difference in response along different dimensions by using separate slope-thresh and step change factor values for each dimension being optimized for a given function.
Monte Carlo noise
There is still quite a bit of noise in these plots. How much? I take a 1D cross-section, at the optimal point, along the slope threshold axis, and find that behavior of the algorithm is erratic. To determine whether or not this is deterministic behavior or simply noise, I increase mean size to show a comparison:
Black dots are Monte Carlo mean. Red dots show upper and lower bound of standard error of mean.
What I learn from this is that noise has a HUGE effect on apparent fitness of the optimization algorithm at various points in parameter-space. There are 2 ways to tackle this problem that I can think of:
1. Find alternative to Monte Carlo estimation of fitness. This is for the best, anyway, since a realistic optimizer does not have access to many random optimizeable landscapes with which to test its algorithm. Rather, it simply applies its algorithm to the 1 landscape that is available and learns by trial and error. This change in my algorithm will mark a shift from Darwinian intelligence (evolution via stochastic methods) to conscious intelligence (evolution directed by conscious perceptions and analysis). This change is necessary in order to reach my eventual goal which is to simulate human intelligence.
2. Optimization algorithm must be able to deal with noise in a reasonable way. Even after stochastic methods have been abandoned, measurements of the optimizeable landscape itself are bound to come with some level of randomness. Analyzing the noise level and optimizing accordingly is necessary in navigating such a landscape.
About 5 months ago, I tripped on 25I-Nbome and had a psychedelic experience in which I realized how optimization algorithms manifest/result in evolution and intelligence. Thereafter, I became very aware of various algorithms that I use in my everyday life to optimize the things that I value. One of the more concrete examples of optimization that I encounter is the physical optimization of the power level of an OPA used in my spectroscopy lab. In optimizing said OPA, there are 7 parameters (2 knobs and 5 computer-controlled motors) that I can adjust. As with any optimization algorithm, each parameter can be thought of as a dimension in which I can move while probing some optimizable function.
This is how I optimize the OPA power:
1. Change a parameter by a certain amount.
2a. If power is higher than before, repeat (1).
2b. If power is lower than before, flip parameter change direction and reduce change amount by half. Repeat (1).
2c. If power seems unchanged, move to next parameter and repeat (1).
[Cease optimization when a suitable power level has been reached]
I began to think very deeply about the following question: Is there an optimal way to experimentally find the optimum of ANY n-dimensional space. The answer to my question requires optimization of the optimization algorithm itself. I started thinking about and programming a self-optimizing algorithm. I want this algorithm to be experimental in nature because I am primarily interested in optimization as it applies to life and consciousness. Algorithms such as gradient descent require knowledge of the underlying function, but we experience the world one piece at a time and thus must use methods that apply in such cases when information is very limited. My idea is that there is a general template of an optimizable function in which any problem encountered in life can be formatted, and that there is an associated general optimization algorithm with internal parameters that can themselves be optimized such that the mean optimization result over all functions in the optimizable-function space is an optimum.
I started writing the program in C++ but switched to Scilab for convenience and portability. The general optimization algorithm I have chosen is currently incomplete and is modeled along the method I use to optimize power levels in the lab. The optimization algorithm takes algorithm parameters, the function-to-be-optimized, and a starting point as input. The optimization parameters are:
1. starting step-size
2. step change
3. slope threshold
There are 2 fixed parameters that do not affect optimization:
1. delimiting slope
2. delimiting number of steps
The optimization algorithm is as follows:
(Start with dimension 1 and step-size = p(1))
1. Take a step of given magnitude (starting step-size for first step) along the current dimension.
2a. If function value decreases (algorithm seeks to find a minimum), repeat (1).
2b. If function value increases, reverse step direction and decrease step magnitude by some factor (step change). Repeat (1).
2c. If function value increases by less than the delimitting slope, cease optimization.
2d. If function value increases by less than the slope threshold, change to next dimension and go back to starting step-size. If current dimension is the last in line, decrease starting step-size by [step change] factor and go back to first dimension.
[Cease optimization when delimiting number of steps have been taken]
There are a few additional features that I think are logically desirable in the algorithm:
1. Step-size is increased by [step change] factor if a specified number of steps have been taken in the same direction. Converging on the peak should be quick once it is in a range comparable to the step-size. This feature allows the probe to move quicker in one direction if a peak is not found immediately.
2. Optimization will not cease until at least 1 round (optimization over all dimensions) has been completed.
3. Force positive probe location for meta-optimization. If the slope threshold parameter is negative, weird things happen.
4. Randomize direction between dimension changes. If this does not occur, certain directions are always preferred, and thus certain peak configurations will always yield a poor result.
To test this algorithm, I use it to find a minimum in a field of 6 negative 2D Gaussians. The Gaussians have heights between 0 and -1, FWHMs between 0 and 1, and centers between 0 and 1 in both dimensions. For optimization, the starting point is somewhere between 0 and 1 in both dimensions. The optimization is intended to take place between 0 and 1, and the only parameter that would be scale-dependent is the starting step-size.
Examples of optimization algorithm:
(Title is number of steps required to encounter delimiting value change, N. Black corresponds to minimum)
Top: Different optimums found during different trials with the same Gaussian field.
Bottom: (left)Normal result. (right)Number of steps maxed out during this trial.
Top and Bottom-left: Starting from the same place, 3 different peaks can be found.
Bottom: An example of the step-size being increased in order to find minimum quicker.
To measure the efficacy of a given set of optimization algorithm parameters, I must first specify a definition for fitness. For simplicity's sake, I decide to use the number of steps required to reach AN optimum. To test the general fitness of a given set of optimization parameters, I find the normalized mean of N (mean/max_N) over a specified number of randomly chosen gaussian fields and starting positions (this is a basically a Monte Carlo simulation of parameter fitness). I have a function that finds the normalized mean fitness for a given set of parameters. I pass this function to my optimization algorithm with a set of transcendental optimization parameters (which do not change) and a starting point for the parameters that will be optimized.
A few example results of the meta-optimization algorithm:
(parameters: p=[starting step-size, step change, slope threshold])
Top-right: It found an optimum eventually.
Bottom-right: An example of an error: The most likely explanation is that the delimiting slope is encountered quickly because the starting step-size is so large that the probe immediately moves to an area where the slope is approximately 0 in all directions.
Future improvements
In many ways, this current algorithm is a toy problem with which to test the meta-optimization concept and get everything working on that front. There are many key problems that I need to address for a more complete/general algorithm:
1. I don't think that the same factor should be used for decreasing the step-size after a direction flip and after a round has been completed.
2. There are other values that I could use to define fitness. For example, I could use the probe height at the end of optimization. Instead of using a delimiting slope, I could use only a delimiting step number for all optimizations and then use the final slope as the fitness value.
3. The slope of a step can be below the delimiting slope if a large step is taken OVER a minimum, thus causing the optimization algorithm to cease prematurely. This currently happens too often for my comfort. The change mentioned above in (2) would fix this problem.
4. Is it better to do rough and fine optimization along one dimension before switching to the next, or would it be better to do rough optimization for all dimensions then finer for all?
5. The algorithm currently assumes that there is no noise. A better algorithm would, for example, stop searching for peak when the change resulting from a step is less than the noise level. If the algorithm measures noise level during optimization, it can decide to take more than one measurement per location in order to make more accurate decisions.
6. The algorithm can only remember one point. It uses the previous point to decide it's next decision, but if it used the previous 2 or 3 points, it could calculate rate of change of slope in order to predict better next moves. As the number of points considered increases, the optimization algorithm approaches a discrete gradient descent method.
7. Add a control algorithm that searches for peak by picking points randomly. Comparison to a genetic algorithm would also be interesting.
8. What to do with the several good results for optimal parameters? I could use a genetic algorithm to breed these together.
9. Enable dynamic transcendental parameters. This would allow the meta-optimization algorithm to make use of better sets of parameters as it finds them.
10. Make current algorithm more general. If aspects of the underlying methods in place are truly better than others, then a more general algorithm will prove it.
11. Incorporate function types other than just sum-of-Gaussians. Lorentzians and cones would be good additions. Diagonal peaks should also be possible.
12. Diagonal steps should be possible. An intelligent algorithm should be able to move in two dimensions at once when it finds that optimization correlates with both.
13. How does optimal starting step-size relate to scale of optimizable function. There should be a correlation.
14. What do you do after get to the top of the mountain? You look for a taller mountain to climb. My algorithm should be able to do this eventually.
14. What effect does changing the number of dimensions have on the optimal optimization parameters?
15. It may be necessary to have different step parameters for each dimension. These would not be fixed general values because it should always be possible to mix up dimensions and get the same result. Rather, the algorithm should be able to learn how motion in different dimensions affects the function value, thereby creating custom techniques for each dimension during optimization.
More to come. Thanks for reading.
Hixidom
UPDATE #1 6/6/2013
I spoke with a Mathematics professor today (in particular, the one who taught my Numerical Methods course back in the day). He made the following suggestions:
1. Use final height of probe to define fitness rather than the number of steps required to reach a particular point.
2. Learn more about the shape of the optimization function by holding all optimization parameters fixed but one and seeing how the the fitness of an algorithm is related to the remaining parameter. I assume that the optimization function and the Gaussian-sum function are similar, when this is likely not the case.
3. Try reducing the number of optimization parameters by defining, or finding through analysis, a reasonable relationship between two or more of the parameters.
UPDATE #2 6/15/2013
So, I changed the optimization algorithm so that it now takes a delimiting number of steps as input and gives the probe level reached after that many steps. The problem is that if I allow enough steps to actually find the minimum, almost all algorithms come so close to the minimum that the contrast between the results for different algorithms is less than the noise level inherent in the Monte Carlo simulation that is used for meta-optimization. More on this later.
Visualization of algorithm parameter-space
The first thing I did in response to the mathematician's recommendations is perform an experiment that fully explores the 3 algorithm parameters used. The result is a series of cross-sections showing algorithm fitness with respect to starting step-size and step change factor, with slices taken along the slope threshold axis:
The optimal point seems to occur at starting step-size=0.09, step change factor=0.8, slope threshold=0.52, so I take more detailed cross-sections along each parameter dimension at this optimal point:
Top-Left: A higher contrast plot of the optimal point as it appears in the GIF
First of all, regarding the "shape" of parameter-space: Just looking at the Top-Right plot, my first impression is that it is Gaussian in the step change factor direction and exponential in the starting step-size direction. I restrict starting step-size values to be positive in my meta-optimization algorithm, but this is probably not necessary. The other parameters, however, must be positive for the algorithm to work. I guess there's no harm in allowing negative parameters; they just wont converge. There will, however, be problems if, for example, the starting step-size were to end up being 0 somehow, since this would cause a division by 0 at one point in the algorithm. I should add a means of restricting bad parameters. In principle, the function to be optimized should be able to return an error result. In other words, it is not unreasonable for a function to have regions that cannot be probed for one reason or another, and this trait should be programmed into functions from now on where applicable, with handling of such errors being programmed into the optimization algorithm accordingly.
Another thing I notice, in the Bottom-Left plot, is that there is almost no dependence on slope threshold. This can be problematic, and it may be necessary to account for this type of difference in response along different dimensions by using separate slope-thresh and step change factor values for each dimension being optimized for a given function.
Monte Carlo noise
There is still quite a bit of noise in these plots. How much? I take a 1D cross-section, at the optimal point, along the slope threshold axis, and find that behavior of the algorithm is erratic. To determine whether or not this is deterministic behavior or simply noise, I increase mean size to show a comparison:
Black dots are Monte Carlo mean. Red dots show upper and lower bound of standard error of mean.
What I learn from this is that noise has a HUGE effect on apparent fitness of the optimization algorithm at various points in parameter-space. There are 2 ways to tackle this problem that I can think of:
1. Find alternative to Monte Carlo estimation of fitness. This is for the best, anyway, since a realistic optimizer does not have access to many random optimizeable landscapes with which to test its algorithm. Rather, it simply applies its algorithm to the 1 landscape that is available and learns by trial and error. This change in my algorithm will mark a shift from Darwinian intelligence (evolution via stochastic methods) to conscious intelligence (evolution directed by conscious perceptions and analysis). This change is necessary in order to reach my eventual goal which is to simulate human intelligence.
2. Optimization algorithm must be able to deal with noise in a reasonable way. Even after stochastic methods have been abandoned, measurements of the optimizeable landscape itself are bound to come with some level of randomness. Analyzing the noise level and optimizing accordingly is necessary in navigating such a landscape.