Re: [CollEc] Website fail

15 Sep 2024

      You told me that you did not want to use a data base. You said you wanted it written to text files. Text files are not a data base.

Honestly, precomputing all shortest paths is a terrible idea. It is unnecessarily inefficient. Centrality measures need to be computed beforehand, but paths should be derived during user sessions. All paths taken together occupy hundreds of GB on disk.

The best way would to store the data in Neo4j and update the data base based on messages to an API. But there is no API. CollEc's input is an xml file, which does not even come with a change log, just as the full data set.

I can reduce the number of threads, i.e. the number of workers running in parallel, if the load is too heavy. RAM utilization is already minimal. The new code is the most performant program any version of CollEc has ever seen.

I have sacrificed multiple days to craft this piece of software exactly to your demands. You now have the binary paths for individual authors. You have distance values and you have closeness centrality results. Everything is stored in the requested antique output formats.

All I get in return is insults. First, I am accused of not writing the code myself. Then, you complain about system design despite it meeting exactly the requirements.

If you had told me before that you do not want me to implement this, it would have saved me a lot of work. Just do it yourself. Write it in perl, cobol, or whatever. I am out. This was my last contribution to CollEc.

On Monday, September 16th, 2024 at 00:39, Thomas Krichel <krichel@openlib.org> wrote:
...

...

...
Christian Düben writes

...
...
For performance reasons, threads write to their own files.

...

...
I am not sure what threads are and why we need them here. All I
need is to have the paths from one author to all others in a
file. These can all be run in parallel. In your run, you seem to try
to do all authors at the same time. This poses a great strain on the
machine. I suggest to calculate one author at a time, using
parallel proccessig in a database on when author data has been
changed.

...
...
This way, I can use parallelism without locks. If you prefer all
paths, distances, and closeness centrality values to respectively be
in single files instead of thread-specific files, I can change
that. However, that probably slows down the program's execution.

...

...
This massive parallel way of handling the job makes no sense to
me.

...
...
All shortest paths within an author pair are not necessarily stored
consecutively. A paths file might contain the first shortest path
from author 1 to author 2, followed by the first shortest path from
author 1 to author 4, followed by the second shortest path from
author 1 to author 2. I can order them, if needed - again at a
performance penalty.

...

...
This makes no sense to me. This is not how I built the old
CollEc. I ran a system that took nodes and updated them. Then
I could run updates around the clock, and I can ran as many
processess as I have machine capacity for. That is a
completely different approach than what you try, which is
to make a complete calculation every now and then.

...
Now the machine is so slow that I can hardly use it.

...
It would be better to solve the task at hand, which is
to create a fast program to do binary paths for an
individual author. I can then take this up and try
to rescuciate the old site.

...

...
--
Written by Thomas Krichel http://openlib.org/home/krichel on his 21653rd day.