The site spins for a long time until "Failed to start app CollEc" Christian Zimmermann https://ideas.repec.org/zimm/
Christian Zimmermann writes
The site spins for a long time until "Failed to start app CollEc"
If CD needs to reboot, I should be in bed in about three hours from now. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21650th day.
CollEc does not work, I know. The app requires deeper maintenance. I do not know at what point I will take care of it. I have a lot of projects on my plate. And I promised my wife that I will change my lifestyle and make more time for her once she joins me in Australia. What I could probably do next month is rewrite the code generating graph-theoretical statistics. Those could then be displayed on another site, such as IDEAS. However, fixing the app is different story. It is written in R shiny, which is with its hidden layer annoying to debug. If someone else volunteers to write a new app, I happily remove the current one. Otherwise, there will be no new app before some time next year, after I have properly familiarized myself with React. Sorry. On Thursday, September 12th, 2024 at 23:27, Thomas Krichel <krichel@openlib.org> wrote:
Christian Zimmermann writes
The site spins for a long time until "Failed to start app CollEc"
If CD needs to reboot, I should be in bed in about three hours from now.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21650th day.
_______________________________________________ CollEc-run mailing list CollEc-run@lists.openlib.org http://lists.openlib.org/cgi-bin/mailman/listinfo/collec-run
Christian Düben writes
What I could probably do next month is rewrite the code generating graph-theoretical statistics. Those could then be displayed on another site, such as IDEAS.
Will this generate the paths and store them in a database? Then we could bring back the old site. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21650th day.
I was more thinking of the centrality measures, as they take much less space on disk than the full paths do. How exactly would you want the paths to be stored? In Postgres arrays? On Friday, September 13th, 2024 at 00:14, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
What I could probably do next month is rewrite the code generating graph-theoretical statistics. Those could then be displayed on another site, such as IDEAS.
Will this generate the paths and store them in a database? Then we could bring back the old site.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21650th day.
Christian Düben writes
How exactly would you want the paths to be stored?
Ideally in the form as they would be in the ~/icanis/opt/paths files. These are potentially multiple binary paths. Out of these I would then eliminate those that are not shortest by a weighted criteron.
In Postgres arrays?
I prefer flat files because they are more exportable. If a system does not talk postgres, it can't do much with the tables. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
What is the motivation for storing multiple paths per pair in the first place? How about I export one path per pair and weighting function? On Friday, September 13th, 2024 at 13:52, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
How exactly would you want the paths to be stored?
Ideally in the form as they would be in the ~/icanis/opt/paths files. These are potentially multiple binary paths. Out of these I would then eliminate those that are not shortest by a weighted criteron.
In Postgres arrays?
I prefer flat files because they are more exportable. If a system does not talk postgres, it can't do much with the tables.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
Christian Düben writes
What is the motivation for storing multiple paths per pair in the first place?
Binary paths frequently have multiple minimum length. Without further criteria it is hard to say which one to show to the user as the most evident one.
How about I export one path per pair and weighting function?
If the weighing function is not binary, the result is likely to be meaningless to a user. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
Do you mean unweighted, when you say binary weighting? Two people are either coauthors or they are not? On Friday, September 13th, 2024 at 15:53, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
What is the motivation for storing multiple paths per pair in the first place?
Binary paths frequently have multiple minimum length. Without further criteria it is hard to say which one to show to the user as the most evident one.
How about I export one path per pair and weighting function?
If the weighing function is not binary, the result is likely to be meaningless to a user.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
_______________________________________________ CollEc-run mailing list CollEc-run@lists.openlib.org http://lists.openlib.org/cgi-bin/mailman/listinfo/collec-run
Christian Düben writes
Do you mean unweighted, when you say binary weighting?
Yes.
Two people are either coauthors or they are not?
Genau! -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
I have uploaded a first draft to GitHub: https://github.com/repec-org/collec/blob/transition/cppbackend/graph_results.... It is untested. So, there could well still be bugs in it. I will test it tomorrow. It is past 1 am here. On Friday, September 13th, 2024 at 16:08, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
Do you mean unweighted, when you say binary weighting?
Yes.
Two people are either coauthors or they are not?
Genau!
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
Christian Düben writes
I have uploaded a first draft to GitHub: https://github.com/repec-org/collec/blob/transition/cppbackend/graph_results....
Smashing.
It is untested. So, there could well still be bugs in it.
I presume the bulk of this is copied from some other source.
I will test it tomorrow. It is past 1 am here.
Remember your promises to your wife. I will set of to my lover soon. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
I actually wrote this from scratch. That is also why it needs proper testing. -------- Original Message -------- On 9/14/24 01:35, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
I have uploaded a first draft to GitHub: https://github.com/repec-org/collec/blob/transition/cppbackend/graph_results....
Smashing.
It is untested. So, there could well still be bugs in it.
I presume the bulk of this is copied from some other source.
I will test it tomorrow. It is past 1 am here.
Remember your promises to your wife. I will set of to my lover soon.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21651st day.
_______________________________________________ CollEc-run mailing list CollEc-run@lists.openlib.org http://lists.openlib.org/cgi-bin/mailman/listinfo/collec-run
Christian Düben writes
I actually wrote this from scratch. That is also why it needs proper testing.
I stopped updates of the incoming data. I made run for your node icanis@helos:~$ date ; ~/icanis/perl/update_paths ras biwe pdb4 ; date Sat 14 Sep 06:39:20 UTC 2024 start doing pdb4 ... The End Sat 14 Sep 06:55:09 UTC 2024 The files are here icanis@helos:~/icanis/opt/paths/ras/biwe/d/b/4$ ls -lrt total 4408 -rw-rw-r-- 1 icanis icanis 2261317 Sep 14 06:55 paths -rw-rw-r-- 1 icanis icanis 153359 Sep 14 06:55 inter -rw-rw-r-- 1 icanis icanis 2091285 Sep 14 06:55 notes If you create a similar structure, at least for the paths, I can check whether the result is the same. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21652nd day.
Maybe the web site is failing but we have too many containers running. krichel@helos:~$ ps axf | grep -c shim-runc 74 Now the box appears to be swapping and I am having a hard time running my mail. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
Yes, sorry. I started the C++ code and then realized that the containers are occupying a lot of RAM. So, I am currently removing containers and running the C++ code at the same time. The containers will be gone within no more than a few minutes. On Sunday, September 15th, 2024 at 16:52, Thomas Krichel <krichel@openlib.org> wrote:
Maybe the web site is failing but we have too many containers running.
krichel@helos:~$ ps axf | grep -c shim-runc 74
Now the box appears to be swapping and I am having a hard time running my mail.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
After hours of bug fixing, I have now uploaded a supposedly correct version of the code. To facilitate maintenance by future volunteers, it has a bunch of comments. I am currently running a test on the full input data. For performance reasons, threads write to their own files. This way, I can use parallelism without locks. If you prefer all paths, distances, and closeness centrality values to respectively be in single files instead of thread-specific files, I can change that. However, that probably slows down the program's execution. All shortest paths within an author pair are not necessarily stored consecutively. A paths file might contain the first shortest path from author 1 to author 2, followed by the first shortest path from author 1 to author 4, followed by the second shortest path from author 1 to author 2. I can order them, if needed - again at a performance penalty. On Sunday, September 15th, 2024 at 16:52, Thomas Krichel <krichel@openlib.org> wrote:
Maybe the web site is failing but we have too many containers running.
krichel@helos:~$ ps axf | grep -c shim-runc 74
Now the box appears to be swapping and I am having a hard time running my mail.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
Christian Düben writes
For performance reasons, threads write to their own files.
I am not sure what threads are and why we need them here. All I need is to have the paths from one author to all others in a file. These can all be run in parallel. In your run, you seem to try to do all authors at the same time. This poses a great strain on the machine. I suggest to calculate one author at a time, using parallel proccessig in a database on when author data has been changed.
This way, I can use parallelism without locks. If you prefer all paths, distances, and closeness centrality values to respectively be in single files instead of thread-specific files, I can change that. However, that probably slows down the program's execution.
This massive parallel way of handling the job makes no sense to me.
All shortest paths within an author pair are not necessarily stored consecutively. A paths file might contain the first shortest path from author 1 to author 2, followed by the first shortest path from author 1 to author 4, followed by the second shortest path from author 1 to author 2. I can order them, if needed - again at a performance penalty.
This makes no sense to me. This is not how I built the old CollEc. I ran a system that took nodes and updated them. Then I could run updates around the clock, and I can ran as many processess as I have machine capacity for. That is a completely different approach than what you try, which is to make a complete calculation every now and then. Now the machine is so slow that I can hardly use it. It would be better to solve the task at hand, which is to create a fast program to do binary paths for an individual author. I can then take this up and try to rescuciate the old site. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
You told me that you did not want to use a data base. You said you wanted it written to text files. Text files are not a data base. Honestly, precomputing all shortest paths is a terrible idea. It is unnecessarily inefficient. Centrality measures need to be computed beforehand, but paths should be derived during user sessions. All paths taken together occupy hundreds of GB on disk. The best way would to store the data in Neo4j and update the data base based on messages to an API. But there is no API. CollEc's input is an xml file, which does not even come with a change log, just as the full data set. I can reduce the number of threads, i.e. the number of workers running in parallel, if the load is too heavy. RAM utilization is already minimal. The new code is the most performant program any version of CollEc has ever seen. I have sacrificed multiple days to craft this piece of software exactly to your demands. You now have the binary paths for individual authors. You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats. All I get in return is insults. First, I am accused of not writing the code myself. Then, you complain about system design despite it meeting exactly the requirements. If you had told me before that you do not want me to implement this, it would have saved me a lot of work. Just do it yourself. Write it in perl, cobol, or whatever. I am out. This was my last contribution to CollEc. On Monday, September 16th, 2024 at 00:39, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
For performance reasons, threads write to their own files.
I am not sure what threads are and why we need them here. All I need is to have the paths from one author to all others in a file. These can all be run in parallel. In your run, you seem to try to do all authors at the same time. This poses a great strain on the machine. I suggest to calculate one author at a time, using parallel proccessig in a database on when author data has been changed.
This way, I can use parallelism without locks. If you prefer all paths, distances, and closeness centrality values to respectively be in single files instead of thread-specific files, I can change that. However, that probably slows down the program's execution.
This massive parallel way of handling the job makes no sense to me.
All shortest paths within an author pair are not necessarily stored consecutively. A paths file might contain the first shortest path from author 1 to author 2, followed by the first shortest path from author 1 to author 4, followed by the second shortest path from author 1 to author 2. I can order them, if needed - again at a performance penalty.
This makes no sense to me. This is not how I built the old CollEc. I ran a system that took nodes and updated them. Then I could run updates around the clock, and I can ran as many processess as I have machine capacity for. That is a completely different approach than what you try, which is to make a complete calculation every now and then.
Now the machine is so slow that I can hardly use it.
It would be better to solve the task at hand, which is to create a fast program to do binary paths for an individual author. I can then take this up and try to rescuciate the old site.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
Christian Düben writes
I have sacrificed multiple days to craft this piece of software exactly to your demands.
I offered to go back to the original site since your site does not work at this time. What I suggested was for you to write the path files, since my perl is very slow. It implies havig a system command that writes the binary paths in the places that I indicated, a system command to do a simple path file. It is not about calculating all the values. They are agregated from the path files. This is the old design. It is very different from yours.
You now have the binary paths for individual authors. You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats.
It's not in the individual files, and it is not updatable on a per person per file basis. This is a very different approach. My approach never meant to calculate the correct result. Instead, it calculated an approximation that would approach the correct result if the network does not change. You want to take snapshots and then calculate the correct result for it. You would take them, say every week or so. This is an aproach that is completely different to mine where such discreet periods don't exist.
All I get in return is insults.
I am sorry. I surely did not mean to insult you and I don't think I did.
First, I am accused of not writing the code myself.
I was simply supposing you would have taken the algorithm from somewhere, that's all. It is what I did.
Then, you complain about system design despite it meeting exactly the requirements.
I am not saying that my design is superior to yours. It is just that if I want to pick up things from the old design, I need the things in the old state.
If you had told me before that you do not want me to implement this, it would have saved me a lot of work.
I said I need individual path files. This makes no sense to you, but it is the old design. I'm all in favour of doing something different, and in the view of the modern viewer, better. But it needs to actually run.
Just do it yourself. Write it in perl, cobol, or whatever. I am out. This was my last contribution to CollEc.
You are the person who runs CollEc. I don't do it anymore. I was proposing to help with a stop-gap measure. I hope you will reconsider. As I ought to have mentioned before, this would be better discussed on the phone. -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
Christian Düben writes
Honestly, precomputing all shortest paths is a terrible idea. It is unnecessarily inefficient.
This is dependent on what we say the aim is. I always thought the aim is for folks to see the path. Here is your path so some other economist.
Centrality measures need to be computed beforehand, but paths should be derived during user sessions.
If you say so you must be right. But my design does not suit this thinking. It was build on the idea path first, centrality second. You think the opposite. I lack knowledge to ascertain whether my or your apprach is better. I suspect is is a matter of business case
All paths taken together occupy hundreds of GB on disk.
I am by no means a specialist in this, but the problem is not disk space. The problem is computing time.
The best way would to store the data in Neo4j and update the data base based on messages to an API.
I don't knwo what neo4j is but, yes, I think that is correct. We want to calculate new paths on demand when we think that something is changed. I am not a specialist in this area, that's why I used my admitingly primitive but robust approach. As I look at neo4j I see it's a commercial offering which is likely to lead to funding problems down the line.
But there is no API. CollEc's input is an xml file, which does not even come with a change log, just as the full data set.
If you say what the changelog should be we can build one.
I can reduce the number of threads, i.e. the number of workers running in parallel, if the load is too heavy. RAM utilization is already minimal. The new code is the most performant program any version of CollEc has ever seen.
Yes, it would need to run continously and write the paths in files per origin.
I have sacrificed multiple days to craft this piece of software exactly to your demands. You now have the binary paths for individual authors.
In a bunch of aggregates that are 100G each (?), which I then have to parse, but when?
You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats.
I can try write software that try to compile my path files from your output -- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.
participants (4)
-
Christian Düben -
Christian Düben -
Christian Zimmermann -
Thomas Krichel