You told me that you did not want to use a data base. You said you wanted it written to text files. Text files are not a data base. Honestly, precomputing all shortest paths is a terrible idea. It is unnecessarily inefficient. Centrality measures need to be computed beforehand, but paths should be derived during user sessions. All paths taken together occupy hundreds of GB on disk. The best way would to store the data in Neo4j and update the data base based on messages to an API. But there is no API. CollEc's input is an xml file, which does not even come with a change log, just as the full data set. I can reduce the number of threads, i.e. the number of workers running in parallel, if the load is too heavy. RAM utilization is already minimal. The new code is the most performant program any version of CollEc has ever seen. I have sacrificed multiple days to craft this piece of software exactly to your demands. You now have the binary paths for individual authors. You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats. All I get in return is insults. First, I am accused of not writing the code myself. Then, you complain about system design despite it meeting exactly the requirements. If you had told me before that you do not want me to implement this, it would have saved me a lot of work. Just do it yourself. Write it in perl, cobol, or whatever. I am out. This was my last contribution to CollEc. On Monday, September 16th, 2024 at 00:39, Thomas Krichel <krichel@openlib.org> wrote:
Christian Düben writes
For performance reasons, threads write to their own files.
I am not sure what threads are and why we need them here. All I need is to have the paths from one author to all others in a file. These can all be run in parallel. In your run, you seem to try to do all authors at the same time. This poses a great strain on the machine. I suggest to calculate one author at a time, using parallel proccessig in a database on when author data has been changed.
This way, I can use parallelism without locks. If you prefer all paths, distances, and closeness centrality values to respectively be in single files instead of thread-specific files, I can change that. However, that probably slows down the program's execution.
This massive parallel way of handling the job makes no sense to me.
All shortest paths within an author pair are not necessarily stored consecutively. A paths file might contain the first shortest path from author 1 to author 2, followed by the first shortest path from author 1 to author 4, followed by the second shortest path from author 1 to author 2. I can order them, if needed - again at a performance penalty.
This makes no sense to me. This is not how I built the old CollEc. I ran a system that took nodes and updated them. Then I could run updates around the clock, and I can ran as many processess as I have machine capacity for. That is a completely different approach than what you try, which is to make a complete calculation every now and then.
Now the machine is so slow that I can hardly use it.
It would be better to solve the task at hand, which is to create a fast program to do binary paths for an individual author. I can then take this up and try to rescuciate the old site.
-- Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.