Needle in a haystack?

I’m trying to determine if AMPL is a good choice for solving the following kind of problem, if it is feasible then a pointer to an example that follows a similar format would be really appreciated.

A national political party is trying to determine which zip codes in the country should be targeted with national grant funding to elect their parties candidates. For each of the 100,000 zip codes and for each of the last ten relevant elections they have a spreadsheet containing around a million rows of data, like this…

Outcome Date as Int Zip F1 F2 F3
0 21914 90210 7534 3453 5534
1 23340 90210 7854 3456 5463

Where F1,F2 & F3 are factors like number of registered voters in zip, number of homes in zip, number of existing politicians from that party in that area etc. In practice there are more like 50 factors that they believe might be important. All of which are positive integer type data.

The outcome column is either 0 lost that election, or 1 won that election. What I am trying to do is ingest this large amount of data and then use it to obtain minimum and maximum values i.e. F1min, F1max, F2min, F2max, F3min, F3 max, such that if applied as constraints you would get the maximum number of 1’s in the outcome and EQUALLY IMPORTANT the minimum number of zeros.

The idea is that once these values are found they can be used to evaluate new areas that might hold elections in the future where depending on the min/max values of F1 to F3 the party will apply funding in the hope of a win, will decide that particular zip is not competitive for them and won’t spend money, or worst case apply funding and then loose.

I appreciate that this is a big problem and would be open to reducing the number of rows, by random sampling, or just throwing away older data. My hope is to get this model working with a smaller number of rows and then submit the whole thing to an online solver like NEOS which could hopefully handle more data. I’m using the Python interface at the moment, but would be happy to do it differently if that was a better approach. This is a hobby project, so I can’t pay for a commercial solver. Complete novice here, so I apologize if the answer should be obvious.

Thanks for any help,

Dan.

Your situation is like what is seen in logistic regression, where the goal is to construct a function that predicts from various inputs whether the output will be 0 or 1. Our Python notebook on logistic regression with amplpy gives an example, though it takes a different approach to the objective function that what you have in mind.

Here are some further observations:

  • You can maximize the number of 1s by selecting all rows, and you can minimize the number of 0s by selecting no rows. What you want is an objective that balances increasing the number of 1s and decreasing the number of 0s. That is the modeling part of the problem that you will need to work on.

  • When developing your model, you may want to consider AMPL’s direct spreadsheet interface, which reads from the spreadsheet file directly into AMPL sets and parameters.

  • The web interface to NEOS offers the greatest variety of solvers, but requires the data to be in AMPL text (.dat file) format. The kestrel interface to NEOS does not impose this data limitation, but works only with certain solvers.

Many thanks for taking a look at my problem. I’ll take a look at the example you suggested. Since I wrote the original post think I have a working model, in that it works correctly for a trivial row count, but when I try to submit the full data or even a reasonable sample to NEOS I hit the 3GB memory limit. At least 128GB desktops are now a thing, so maybe I’ll have to budget for that next year if the problem still seems interesting. Do you happen to know if Amazon Web Services (or one of the other cloud providers) has created any kind of pre-built virtual machine, intended for this sort of problem. I was thinking I could rent some capacity from them for a week or so, once I know my model is working well, though of course there is the issue of a commercial solver license for a limited time, which is why I was wondering if there were any pre-built VM’s out there?

Thanks again, Dan